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Abstract — Lower bounds involving /-divergences between the 
underlying probability measures are proved for the minimax risk 
in estimation problems. Our proofs just use simple convexity 
facts. Special cases and straightforward corollaries of our bounds 
include well known inequalities for establishing minimax lower 
bounds such as Fano's inequality, Pinsker's inequality and 
inequalities based on global entropy conditions. Two applications 
are provided: a new minimax lower bound for the reconstruction 
of convex bodies from noisy support function measurements 
and a different proof of a recent minimax lower bound for the 
estimation of a covariance matrix. 

Index Terms — /-divergences; Fano's inequality; Minimax 
lower bounds; Pinsker's inequality; Reconstruction from support 
functions. 



I. Introduction 

CONSIDER an estimation problem in which we want 
to estimate 9 e O based on an observation X from 
{Pg,9 G 6} where each Pg is a probability measure on a 
sample space X. Suppose that estimators are allowed to take 
values in A 2 9 and that the loss function is of the form 
t{p) where p is a metric on A and I : [0, oo) — > [0, oo) is a 
nondecreasing function. The minimax risk for this problem is 
defined by 

R := inf sup E 8 £(p(9, 9(X))), 
e eee 

where the infimum is over all measurable functions 9 : X — > A 
and the expectation is taken under the assumption that X is 
distributed according to Pg. 

In this article, we are concerned with the problem of 
obtaining lower bounds for the minimax risk R. Such bounds 
are useful in assessing the quality of estimators for 9. The 
standard approach to these bounds is to obtain a reduction 
to the more tractable problem of bounding from below the 
minimax risk of a multiple hypothesis testing problem. More 
specifically, one considers a finite subset F of the parameter 
space and a real number r\ such that p(9, 9') > rj for 
9,9' G F,9 ^ 9' and employs the inequality R > £(rj/2)r, 
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out that Theorem III. 1 1 appears in his paper [14]. Specifically, in a different 
notation, inequality (3) appears as Theorem 1 and inequality (4} appears in 
Section 4.3 in 1141 . The proof of Theorem III. II presented in section [TT] is 
different from that in 1141 . Also, except for Theorem III. 1 1 and the observation 
that Fano's inequality is a special case of Theorem III. II (see Example I II. 41 , 
there is no other overlap between this paper and 1141 . 
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versity, 24 Hillhouse Avenue, New Haven, CT 06511, USA. e-mail: 
adityanand.guntuboyina@yale.edu 



where 



r := inf sup P e {T^ 9} , 
T eeF 



(1) 



the infimum being over all estimators T taking values in F. 
The proof of this inequality relies on the triangle inequality 
satisfied by the metric p and can be found, for example, 
in [I] Page 1570, Proof of Theorem 1] (Let us note, for the 
convenience of the reader, that the notation employed by Yang 
and Barron [1| differs from ours in that they use d for the 
metric p, e n .d for our r\ and N Cn: d for the finite set F. Also 
the proof in |[T) involves a positive constant A which can be 
taken to be 1 for our purposes. The constant A arises because 
Yang and Barron [ 1 1 do not require that d is a metric but rather 
require it to satisfy a weaker local triangle inequality which 
involves the constant A.) 

The next step is to note that r is bounded from below by 
Bayes risks. Let w be a probability measure on F. The Bayes 
risk f w corresponding to the prior w is defined by 



r w := inf V 



MY,mPe{T^9}, 

eeF 



(2) 



where wg := w {9} and the infimum is over all estimators T 
taking values in F. When w is the discrete uniform probability 
measure on F, we simply write r for f w . The trivial inequality 
r > r w implies that lower bounds for f w are automatically 
lower bounds for r. 

The starting point for the results described in this paper is 
Theorem lII.il which provides a lower bound for f w involving 
/-divergences of the probability measures Pg,9 G F. The 
/-divergences (|E]-||5)) are a general class of divergences 
between probability measures which include many common 
divergences/distances like the Kullback Leibler divergence, 
chi-squared divergence, total variation distance, Hellinger dis- 
tance etc. For a convex function / : [0, oo) — > K satisfying 
/(l) = 0, the /-divergence between two probabilities P and 
Q is given by 



Wll^.:=//(g 



dQ 



if P is absolutely continuous with respect to Q and oo 
otherwise. 

Our proof of Theorem III. II presented in section [II] is ex- 
tremely simple. It just relies on the convexity of the function 
/ and the standard result that f w has the following exact 
expression: 



1 



x 



max {wgpg(x)} dp(x), (3) 

6eF 
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where pg denotes the density of Pg with respect to a common 
dominating measure /i (for example, one can take \i := 

Y.g £F Po)- 

We show that Fano's inequality is a special case (see 
Example III.41 i of Theorem III. II obtained by taking f(x) — 
x\ogx. Fano's inequality is used extensively in the non- 
parametric statistics literature for obtaining minimax lower 
bounds, important works being [1|, [6|-[11|. In the special 
case when F has only two points, Theorem III. 1 1 gives a sharp 
inequality relating the total variation distance between two 
probability measures to /-divergences (see Corollary 111.31 . 
When f(x) = xlogx, Corollary III. 31 implies an inequality 
due to Tops0e [12] from which Pinsker's inequality can be 
derived. Thus Theorem III. II can be viewed as a generalization 
of both Fano's inequality and Pinsker's inequality. 

The bound given by Theorem III. II involves the quantity 
Jf := infQ J^eeF D f{P$\\Q) /\F\, where the infmrum is over 
all probability measures Q and \F\ denotes the cardinality 
of the finite set F. It is usually not possible to calculate Jf 
exactly and in section [HI] we provide upper bounds for Jf. 
The main result of this section, Theorem Mill provides an 
upper bound for Jf based on approximating the set of |F| 
probability measures {Pg, 9 £ F} by a smaller set of probabil- 
ity measures. This result is motivated by and a generalization 
to /-divergences of a result of Yang and Barron [1| for the 
Kullback-Ieibler divergence. 

In section IIVI we use the inequalities proved in sections [TT] 
and |nI]to obtain minimax lower bounds involving only global 
metric entropy attributes. Of all the lower bounds presented in 
this paper, Theorem IIV. 1 1 the main result of section [TV] is the 
most application-ready method. In order to apply this in a par- 
ticular situation, one only needs to determine suitable bounds 
on global covering and packing numbers of the parameter 
space 6 and the space of probability measures {Pg,9 £ 0} 
(see section [VJ for an application). 

Although the main results of sections [TT] and [TTTJ hold true 
for all /-divergences, Theorem IIV. 1 1 is stated only for the 
Kullback-Ieibler divergence, chi-squared divergence and the 
divergences based on f(x) = x l — 1 for I > 1. The reason 
behind this is that Theorem IIV. 1 1 is intended for applications 
where it is usually the case that the underlying probability 
measures Pg are product measures and divergences such as 
the Kullback-Ieibler divergence and chi-squared divergence 
can be computed for product probability measures. 

The inequalities given by Theorem IIV. 1 1 for the chi-squared 
divergence and divergences based on f(x) = x — 1 for 
I > 1 are new while the inequality for the Kullback-Ieibler 
divergence is due to Yang and Barron [ 1|. There turn out to be 
qualitative differences between these inequalities in the case of 
estimation problems involving finite dimensional parameters 
where the inequality based on chi-squared divergence gives 
minimax lower bounds having the optimal rate while the one 
based on the Kullback-Ieibler divergence only results in sub- 
optimal lower bounds. We shall explain this happening in 
section UV] by means of elementary examples. 

We shall present two applications of our bounds. In sec- 
tion [V] we shall prove a new lower bound for the minimax risk 
in the problem of estimation/reconstruction of a d-dimensional 



convex body from noisy measurements of its support function 
in n directions. In section [VI] we shall provide a different proof 
of a recent result by Cai, Zhang and Zhou lfP3l on covariance 
matrix estimation. 

II. IOWER BOUNDS FOR THE TESTING RISK f w 

We shall prove a lower bound for f w defined in (ff) in 
terms of /-divergences. We shall assume that the N := \F\ 
probability measures Pg,9 € F are all dominated by a sigma 
finite measure /i with densities pg,9 € F. In terms of these 
densities, f w has the exact expression given in (01. A trivial 
consequence of (O that we shall often use in the sequel is that 
r < 1 — 1/N (recall that f is f w in the case when w is the 
uniform probability measure on F ). 

Theorem II.l. Let w be a probability measure on F. Define 
T : X — > F 'by T(x) := argmaxggj? {wgpg(x)}, where wg := 
w {9}. For every convex function / : [0,oo) — > R and every 
probability measure Q on X, we have 



Y, w oDf(Pg\\Q)>Wf(^-^)+(l-W)f' ' 

8£F 



1 - w 



(4) 

where W := J x Wx(x)dQ(x). In particular, taking w to be 
the uniform probability measure, we get that 



Y,Df(Pe\\Q) >f(N(l-r)) + (N-l)f 



9eF 



Nf 
N - 1 



(5) 



The proof of this theorem relies on a simple application of 
the convexity of / and it is presented below. 

Proof: We may assume that all the weights wg are strictly 
positive and that the probability measure Q has a density q 
with respect to /i. We start with a simple inequality for non- 
negative numbers ag,9 € F with r :— argmaxggp' {wgag}. 
We first write 



E 



wgf(ag) = w T f(a T ) + (1 - w T ) - 



wg 



-f(ae) 



and then use the convexity of / to obtain that the quantity 
J2g wgf(ag) is bounded from below by 



w T f{a T ) + (1 - w T )f 



We now fix x € X such that q(x) > and apply the inequality 
just derived to ag := pg(x)/q(x). Note that in this case t = 
T{x). We get that 



8£F 



Pe(x) 
q(x) 



> A(x) + B(x), 



(6) 



where 



and 



A(x) := w T{x) .f ( 



Pt(x){x) 
q(x) 



B(x) := (1 - w T(x) )f 



J2eeF w f>Pe( x ) ~ w T{x) p T(x) {x) ' 
(1 - w T{x) )q{x) 
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Integrating inequality (O with respect to the probability mea- 
sure Q, we get that the term J2eeF Ws ^ f(Ps\\Q) is bounded 
from below by 



A(x)q(x)dfj,(x) + / B(x)q(x)dfi(x). 
x Jx 

Let Q' be the probability measure on X having the density 
q'(x) := Wt( x )Q(x)/W with respect to /i. Clearly 

A(x)q(x)d f i(x) =W f f (PIMP-) q'(x)d^x), 
x Jx V 1( x ) J 

which, by Jensen's inequality, is larger than or equal to 
Wf((l - f w )/W). It follows similarly that 



x 



B(x)q(x)dfi(x) > (1 - W)f 



1 - W 



This completes the proof of inequality (@). When w is the 
uniform probability measure on the finite set F, it is obvious 
that W equals l/N and this leads to inequality (O. ■ 
Let us denote the function of f on the right hand side of (O 
by g: 



g(a) := f(N(l-a)) + (N-l)f 



Na 
N - 1 



(7) 



Inequality © provides an implicit lower bound for f. This is 
because f £ [0,1 — l/N] and g is non-increasing on [0, 1 — 
1 /N] (as can be seen in the proof of the next corollary in the 
case when / is differentiable; if / is not differentiable, one 
needs to work with right and left derivatives which exist for 
convex functions). 

The convexity of / also implies trivially that g is convex, 
which can be used to convert the implicit bound (© into an 
explicit lower bound. This is the content of the following 
corollary. We assume differentiability for convenience; to 
avoid working with one-sided derivatives. 

Corollary II.2. Suppose that f : [0,oo) is a differentiable 
convex function and that g is defined as in (|7j. Then, for every 
a e [0, 1 - l/N], we have 

mi Q j: eeF D f (Pg\\Q)-g(a) 



r > r > a 



9» 



(8) 



where the infimum is over all probability measures Q. 

Proof: Fix a probability measure Q. Inequality (|5]l says 
that ~Y^g e p Df{Pe[[Q) > g{f). The convexity of / implies 
that g is also convex and hence, for every a g [0, 1 — l/N], 
we can write 

Y, D f( p e\\Q)>9{f)>g{a)+g'{a){f-a). (9) 



Also, 



N 



= r 



Because g is convex, we have g'(a) < g'(l — l/N) = 
for a < 1 — l/N (this proves that g is non-increasing on 
[0, 1 — l/N]). Therefore, by rearranging (O, we obtain ([8]). ■ 
Let us now provide an intuitive understanding of inequal- 
ity ([5]). When the probability measures Pg,9 £ F are tightly 



packed i.e., when they are close to one another, it is hard to 
distinguish between them (based on the observation X) and 
hence, the testing Bayes risk f will be large. On the other 
hand, when the probability measures are well spread out, it is 
easy to distiguish between them and therefore, f will be small. 
Indeed, f takes on its maximum value of 1 — l/N when the 
probability measures Pg , 9 £ F are all equal to one another 
and it takes on its smallest value of when max pg = '}2 l pg 
i.e., when Pg, 9 £ F are all mutually singular. 

Now, one way of measuring how packed/spread out the 
probability measures Pg , 9 £ F are is to consider the quantity 
infQ J2geF Df(Pg\\Q), which is small when the probabilities 
are tightly packed and large when they are spread out. It 
is therefore reasonable to expect a connection between this 
quantity and f. Inequality © makes this connection explicit 
and precise. The fact that the function g in (0 is non- 
increasing means that when infQ EeeF Df(Pg]\Q) is small, 
the lower bound on f implied by ([5]) is large and when 
infQ J^geF Df(Pe\\Q) is large, the lower bound on r is small. 

Theorem lll. 1 l implies the following corollary which provides 
sharp inequalities between total variation distance and /- 
divergences. The total variation distance between two prob- 
ability measures is defined as half the L 1 distance between 
their densities. 

Corollary II.3. Let P\ and P2 be two probability measures on 
a space X with total variation distance V. For every convex 
function f : [0, 00) — > R, we have 

inf (D,(Pi||Q) + D f (P 2 \\Q)) > f (1 + V) + f (1 - V) , 

(10) 

where the infimum is over all probability measures Q. More- 
over this inequality is sharp in the sense that for every 

V £ [0, 1], the infimum of the left hand side of ( 1101 ) over all 
probability measures P\ and Pi with total variation distance 

V equals the right hand side of ( 1 10b . 



Proof: In the setting of Theorem III. 11 suppose that 
F = {1,2} and that the two probability measures are Pi and 
P2 with densities p\ and p 2 respectively. Since 2max(pi,p2) 
equals pi + P2 + \pi — P2I. it follows that 2f equal 1 — V. 
Inequality (TlOl is then a direct consequence of inequality (0. 

The following example shows that (flOl l is sharp. Fix 
V £ [0,1]. Consider the space X = {1,2} and define the 
probabilities P x and P 2 by P x {1} = P 2 {2} = (1 + V)/2 
and of course Pi {2} = P 2 {1} = (1 - V)/2. Then the 
total variation distance between Pi and P2 equals V. Also 
if we take Q to be the uniform probability measure Q {1} = 
Q{2} = 1/2, then one sees that D f (P 1 \\Q) + D f (P 2 \\Q) 
equals f(l + V) + f(l — V) which is same as the right hand 
side in dT0]>. ■ 

What we have actually shown in the above proof is that 
inequality ( TTOb is sharp for the space X = {1,2}. However, 
the result holds in more general spaces as well. For example, 
if the space is such that there exist two disjoint nonempty 
subsets Ai and A 2 and two probability measures v\ and v 2 
concentrated on A x and A 2 respectively, then we can define 
Pi := v 1 (l + V)/2 + v 2 {l -V)/2 and P 2 := v 1 (l-V)/2 + 
^2(1 + V)/2 so that V(P\,&i) = V and <[T0> becomes an 
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equality (with Q = i/x/2 + v 2 /2). 

There exist many inequalities in the literature relating the /- 
divergence of two probability measures to their total variation 
distance. We refer the reader to [15| for the sharpest results 
in this direction and for earlier references. Inequality ( flOl l, 
which is new, can be trivially converted into an inequality 
between D f(Pi\\P2) and V by taking Q = P 2 . The resulting 
inequality will not be sharp however and hence will be inferior 
to the inequalities in ifTSl . As stated, inequality (IToT > is a sharp 
inequality relating not D f(Pi\\P2) but a symmetrized form 
of /-divergence between Pi and P 2 to their total variation 
distance. 

In the remainder of this section, we shall apply Theorem lII.il 
and Corollary III. 31 to specific /-divergences. 



Example II.4 (Kullback-Leibler Divergence). Let f(x) := 
x\ogx. Then Df(P\\Q) becomes the Kullback-Leibler 
divergence D(P\\Q) between P and Q. The quantity 
J2eeF D(Pe\\Q) is minimized when Q = P := 
(J2e£F Pg)/N . This is a consequence of the following identity 
which is sometimes referred to as the compensation identity, 
see for example [12, Page 1603]: 

J2 D(P 8 \\Q) = £ D{P e \\P) + ND(P\\Q). 



Using inequality © with Q = P = (J^eeF Pe)/N, we obtain 

1 / Nf 

-Y^D{P e \\P)>(l-f) \og{N{\ - f)) + f log f 



3eF 



N -1 



The quantity on the left hand side is known as the Jensen- 
Shannon divergence. It is also Shannon's mutual informa- 
tion [ 16 , Page 19] between the random parameter 9 distributed 
according to the uniform distribution on F and the obser- 
vation X whose conditional distribution given 9 equals Pg. 
The above inequality is stronger than the version of Fano's 
inequality commonly used in nonparametric statistics. It is 
implicit in IfTTl Proof of Theorem 1] and is explicitly stated 
in a slightly different form in ifTHl Theorem 3]. The proof 
in ifm is based on the Fano's inequality from information 
theory |[T6l Theorem 2.10.1]. To obtain the usual form of 
Fano's inequality as used in statistics, we turn to inequality (0. 
For a := (N - l)/(2N - 1) < 1 - 1/N and the function g 
in (|7), it can be checked that 



K a o) = 



N 2 ( 
■ log N + N lo,~ ' 



N 



2N - 1 



V 2N - 1 



and g'(ao) = —NlogN. Using inequality dU with a = a,Q, 
we get that 



f > 1 



\og((2N-l)/N) + ±j: eeF D(Pe\\P) 
log TV 



Since log((27V - l)/N) < log 2, we have obtained 
log2 + j r ^2 eeF D(P e \\P) 



r > r > 1 



log^V 



(11) 



which is the commonly used version of Fano's inequality. 
By taking f(x) — xlogx in Corollary III. 31 we get that 

D(P 1 \\P)+D(P 2 \\P) > (1+V) log(l+y)+(l-F) log(l-U). 



This inequality relating the Jensen-Shannon divergence be- 
tween two probability measures (also known as capacitory 
discrimination) to their total variation distance is due to 
Tops0e lfl2l Equation (24)]. Our proof is slightly simpler 
than Tops0e's. Tops0e [12] also explains how to use this 
inequality to deduce Pinsker's inequality with sharp constant: 
D(Pi\\P 2 ) > 2V 2 . Thus, Theorem HfU can be considered 
as a generalization of both Fano's inequality and Pinsker's 
inequality to /-divergences. 

Example II.5 (Chi-Squared Divergence). Let f(x) = x 2 — 
1. Then Df(P\\Q) becomes the chi-squared divergence 
X 2 (P\\Q) '■— f P 2 /q ~ 1- The function g can be easily seen 
to satisfy 



9(a) = 



N 3 
N - 1 



'-N- a 



l -N- a 



Because r < 1 — 1/A^, we can invert the inequality 
[ni QT,eeFX 2 (Pe\\Q) > 9(r) to obtain 



r > r > 1 



1 

N 



I J^oEeeFX 2 (Pe\\Q) 



N 



(12) 



Also it follows from Corollary III. 31 that for every two proba- 
bility measures Pi and P 2 , 

M( X 2 (Pi\\Q) + X 2 (P2\\Q))>2V 2 . (13) 

The weaker inequality X 2 (P\\\P) + X 2 (Pi\\P) > 2U 2 can be 
found in lfT2l Equation (11)]. 

Example II.6 (Hellinger Distance). Let f(x) = 1 — ^fx. 
Then D f {P\\Q) = 1 - / y/pqdn = H 2 (P,Q)/2, where 
H 2 (P, Q) — J (y/p — y/q) 2 dfi is the square of the Hellinger 
distance between P and Q. It can be shown, using the Cauchy- 
Schwarz inequality, that J^eeF Df(Pe\\Q) is minimized when 
Q has a density with respect to /i that is proportional to 
(J2eeF \fvef- Indeed if u := J2eeF Vps> then 



N - 



9GF 




U 2 dfl, 



by the Cauchy-Schwarz inequality with equality when q is 
proportional to u 2 . The inequality (|5]l can then be simplified 
to 



v 7 ! - F + V( N - l) f > \f 



J u 2 dpu 



N 



(14) 



We now observe that 



J 9^9' J 9=£9' 

We let h 2 := J2e,e> H 2 (P9,Pg')/N 2 so that / u 2 dfi = 
7V 2 (1 — h 2 /2). As a consequence, we have / u 2 dfi < N 2 . 
Also note that J u 2 d^, > J ' (^2gPe)d^, — N. Therefore, the 
right hand side of the inequality ( fT4b lies between 1 and \fN. 
On the other hand, it can be checked that, as a function of f, 
the left hand side of ( TPfl ) is strictly increasing from 1 at f = 
to \/N at f = I — 1/N. It therefore follows that inequality (TBb 
is equivalent to f > r where f e [0, 1 — 1/^V] is the solution 
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to the equation obtained by replacing the inequality in ( fl4t 
with an equality. 

This equation can be solved in the usual way by squaring 
etc., until we get a quadratic equation in f which can be solved 
resulting in two solutions. One of the two solutions can be 
discarded by continuity considerations (the solution has to be 
continuous in J u 2 dn/N) and the fact that r < 1 — 1/N. The 
other solution equals r and is given by 

Vat 



This can be rewritten to get 



1 N — 2 h 2 

f = 1 

N N 2 



1 



N 



^h 2 {2-h 2 ). 



We have thus shown that 

1 N — 

r > r > 1 

~ ~ N N 



2h 2 



Vh 2 (2-h 2 ). 



2 N 

In the case when N = 2 and F = {1,2}, it is clear that 
h 2 = (H 2 (P 1 ,P 2 ) + H 2 (P 2 ,P 1 ))/4 = H 2 (P u P 2 )/2. Also 
since 2f equals 1 — V, where V denotes the total variation 
distance between Pi and P 2 , the above inequality implies that 
for every pair of probability measures Pi and P 2 , we have 



H 2 (Pi,P 2 ) 



V < H{P u P 2 )\jl - 

This inequality is usually attributed to Le Cam |19|. 

Example II.7 (Total Variation Distance). Let f(x) = \x — 
l|/2. Then Df(P\\Q) becomes the total variation distance 
between P and Q. The function g satisfies 



1,„, , . N- 1 
5 (f) = -|AT(l-f)-l| + __ 



Nf 



N - 1 



- 1 



Since f < 1 - 1/N, we have N(l - f) > 1 and Nr/(N - 
1) < 1 so that the above expression for 17(f) simplifies to 
N — 1 — Nr. Inequality (0, therefore, results in 

r > r > 1 5 . 

~ ~ N N 

where Vg denotes the total variation distance between Pg and 



Example II.8. Let f(x) = x l — 1 where I > 1. The case I = 2 
has already been considered in Example III. 51 The function g 
has the expression 



g{¥) = N l (\-r) 1 - N + (N - 1) 



Nf 
N- 1 



It therefore follows that infQ X^eF Df(Pe\\Q) > g{r) > 
N (1 — r) — N which results in the inequality 

1 , inf Q E^^/(^IIQ) x ' ' 



r > r > 1 



Ni-i ni 

When I = 2, inequality (fT~5T > results in a bound that is weaker 
than inequality (fT2l although for large N, the two bounds are 
almost the same. 

Example II.9 (Reverse Kullback-Leibler divergence). Let 
f(x) = -logx so that D f (P\\Q) = D(Q\\P). Then from 
Corollary III. 31 we get that for every two probability measures 
Pi and P 2 , 

^{DiQWP^+DiQWP^^logfj^) . 



V < 



'l-exp( -M{D(Q\\P!) 



(16) 



Unlike Example IIL41 it is not true that D(Q||Pi) +D(Q\\P 2 ) 
is minimized when Q — P. This is easy to see because 
D(P,Pi) + D{P,P 2 ) is finite only when Pi « P 2 and 
P 2 « Pi. By taking Q = P\ and Q = P 2 , we get that 



V< v / l-exp(-min(D(Pi||P 2 ) ) £>(P 2 ||Pi))). 

The above inequality, which is clearly weaker than inequal- 
ity ( [T6l l. can also be found in ll20l Proof of Lemma 2.6]. 



III. Bounds for J/ 

In order to apply the minimax lower bounds of the previous 
section in practical situations, we must be able to bound the 
quantity Jf := infg ^2g e p Df(Pg\\Q)/N from above. We 
shall provide such bounds in this section. It should be noted 
that for some functions /, it may be possible to calculate Jf 
directly. For example, the quantity infQ J^geF H 2 (Pg, Q) can 
be written in terms of pairwise Hellinger distances (Exam- 
ple III. 61 ) and may be calculated exactly for certain probability 
measures Pg. This is not the case for most functions / 
however. 

The following is a simple upper bound for Jf which, in 
the case when f(x) = x log x or Kullback-Leibler divergence, 
has been frequently used in the literature (see for example ifTUl 
and El]). 

J f <^D f (P e \\P) 
eeF 



< 



^ J2 Df(Pg\\Pg,)<m^ F D f (Pg\\P e/ ). 



We observed in section [TT] that Jf measures the spread 
of the probability measures Pg,6 e F i.e., how tightly 
packed/spread out they are. It should be clear that the simple 
bound maxg g/ Df(Pg\\Pgr) does not adequately describe this 
aspect of Pg,9 £ F and it is therefore desirable to look for 
alternative upper bounds for Jf that capture the notion of 
spread in a better way. 

In the case of the Kullback-Leibler divergence, Yang and 
Barron fl] Page 1571] provided such an upper bound for 
Jf. They showed that for any finite set {Q a : a £ G} of 
probability measures, 



(15) ^E^ii^)^ lo gi G i+ maxmi ji^ii^)- ( 17 > 

iv L — ' 9EF aGG 



9eF 

Let us now take a closer look at this beautiful inequality of 
Yang and Barron |£TJ • The \G\ probability measures Q a ,a £ 
G can be viewed as an approximation of the N probability 
measures Pg,9 £ F. The term max^ min Q D{Pg\\Q a ) then 
denotes the approximation error, measured via the Kullback- 
Leibler divergence. The right hand side of inequality ( TTTb can 
therefore be made small if it is possible to choose not too 
many probability measures Q a which well approximate the 
given set of probability measures Pg. 
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It should be clear how the upper bound dTTb measures 
the spread of the probability measures Pg , 9 E F. If the 
probabilities are tightly packed, it is possible to approximate 
them well with a smaller set of probabilities and then the 
bound will be small. On the other hand, if Pg , 9 E F 
are well spread out, we need more probability measures for 
approximation and consequently the bound will be large. 

Another important aspect of inequality ( fTTI i is that it can be 
used to obtain lower bounds for R depending only on global 
metric entropy properties of the parameter space 9 and the 
space of probability measures {Pg,9 E 8} (see section HVl ). 
On the other hand, the evaluation of inequalities resulting 
from the use of Jf < maxg g/ D(Pg\\Pgi) requires knowledge 
of both metric entropy and the existence of certain special 
localized subsets. We refer the reader to H) for a detailed 
discussion of these issues. 

The goal of this section is to generalize inequality (fTTI i to 
/-divergences. The main result is given below. In section fTVl 
we shall use this theorem along with the results of the previous 
section to come up with minimax lower bounds involving 
global entropy properties. 

Theorem III.l. Let Q a ,a E G be M :— \G\ probability 
measures having densities q ai a E G with respect to /i and 
let j : F — > G be a mapping from F to G. For every convex 
function f : [0, oo) — > R, we have 



Jf<~rY 

t - N 



9GF 



1j(e) , / Mp e 



X 



M 



:f 



1j{6) 



dfi- 



1 

M 



/(0). (18) 



Proof: Let Q := ^, a&G Q a /M and Q ■= EaeG W M - 



Clearly for each 9 E F, we have 



Df(PeWQ) = / q 



X 



I 



^ -/(0) 
q 



dn + /(0). 



The convexity of / implies that the map y H> y[f(a/y) — /(0)] 
is non-increasing for every nonnegative a. Using this and the 
fact that q > qj/g\/M, we get that for every 9 E F, 

(Mpe 



Df(Pe\\Q) < 



,v 



M 



I 



\9m 



/(o) 



d(i + f(0). 



Inequality ( fT8l now follows as a consequence of the inequality 

Jf<Ee & F D f( p e\\Q)/ N - ■ 

In the following examples, we shall demonstrate that The- 
orem IIH. II is indeed a generalization of the bound (fTTI i to 
/-divergences. We shall also see that Theorem IIH. 1 1 results in 
inequalities that have the same qualitative structure as ( fTTI i. at 
least for the convex functions / of interest such as x — 1, 1 > 1 
and (^/x — l) 2 . 

Example III.2 (Kullback-Leibler divergence). Let f(x) = 
xlogx. In this case, Jf equals ^2g e pF>(Pg\\P)/N and in- 
voking inequality d 1 8b . we get that 



±^D(P e \\P) <logM + ±J2 D ( p eWQi 



>)■ 



Inequality (fTTI i now follows if we choose j(9) := 
argmin ae G F>(Pg\\Q a ). Hence Theorem IIH. II is indeed a 
generalization of (fTTJ). 



Example III.3. Let f(x) 
inequality ( fT8l ), we get that 



1 for I > 1. Applying 



Jf < M l ~ l ( 1 D f (Pe\\Q m ) + 1 j - 1. 

By choosing j(9) = argmin Qe G Df(Pe\\Qa), we get that 
J f < M 1 " 1 ( maxminL» / (P ||Q Q ) + 1 ) - 1. (19) 

y 8EF aEG J 

In particular, in the case of the chi-squared divergence i.e., 
when / = 2, the quantity Jj = infg J2g<=F X 2 {Pe\\Q)/N is 
bounded from above by 



M { max minx 2 (-Pell Qa) + 1 ) - 1. 

\0eF aGG J 



(20) 



Just like (fTTJ, each of the above two inequalities is also a 
function of the number of probability measures Q a and the 
approximation error which is now measured in terms of the 
chi-squared divergence. 

Example III.4 (Hellinger distance). Let f(x) = (^/x - l) 2 
so that D f (P\\Q) = H 2 {P,Q), the square of the Hellinger 
distance between P and Q. Using inequality ( fTFt . we get that 



J, < 2 



^(*-i£/*<*.e*>>) 



If we now choose j(9) := argmin Qe G H 2 (Pg, Q a ), then we 
get 

1 



J f <2 



2 — max min H (Pg , Q n ) 

IM V eeF a ec 



Notice, once again, the trade-off between M and the approx- 
imation error which is measured in terms of the Hellinger 
distance. 

IV. Bounds involving global entropy 

In this section, we shall apply the results of the previous 
two sections to obtain lower bounds for the minimax risk R 
depending only on global metric entropy properties of the 
parameter space. The theorem is stated below, but we shall 
need to establish some notation first. 

1) For ?7 > 0, let N(t]) > 1 be a real number for which 
there exists a finite subset F C O with cardinality > 
N(rf) satisfying p(9,9') > r\ whenever 9,9' E F and 
9 ^ 9'. In other words, N(r]) is a lower bound on the 
^-packing number of the metric space 

2) For a convex function / : [0, oo) —> M satisfying 
/(l) = 0, a subset S C 6 and a positive real 
number e, let Mf(e,S) be a positive real number 
for which there exists a finite set G with cardinality 
< Mf(e,S) and probability measures Q a ,a E G such 
that supg gS min ag G Df(Pe\\Qa) < e 2 - In other words, 
Mf(e,S) is an upper bound on the e-covering number 
of the space {Pg : 9 E S} when distances are measured 
by the square root of the /-divergence. For purposes of 
clarity, we write M KL {e, S), M c (e, S) and M/(e, S) for 
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Mf(e, S) when the function / equals x log x, x 2 — 1 and 

x l — 1 and respectively. 
We note here that the probability measures Q a , a £ G in the 
definition of Mf(e, S) do not need to be included in the set 
{Pg, 9 £ 9} and the set G just denotes the index set and need 
not have any relation to S or 9. 

Theorem IV. 1. The minimax risk R satisfies the inequality 
R > sup^ >0 e>0 £(r)/2)(l — -k) where * stands for any of the 
following quantities 

c i 

(21) 



log2 + logM ifi (e,9) + e 2 



logiVfa) 



1 



'(l + e 2 )Af c (e,e) 



N(tj) 
and for I > 1, 1 =/= 2, 



1 



(l + e 2 )Af,(e,e) 



l-i 



(22) 



(23) 



In the sequel, by inequality d22l . we mean the inequality 
R > sup J?> Q e> g £(r]/2)(l — -k) with * representing i22\ and 
similarly for inequalities d2"TT i and ( 1231 . 

Proof: We shall give the proof of inequality (1221 . The 
remaining two inequalities are proved in a similar manner. Fix 
rj > 0. By the definition of N(r)), one can find a finite subset 
F C 9 with cardinality |F| > AT (77) such that p(6, 9') > 77 
for 6>, 0' G F and 9 ^ 6*'. We then employ the inequality 
R > £(v/2) r > where r is defined as in (|T). Inequality (fT2l can 
now be used to obtain 



r > 1 




^QEeeFX 2 (Pe\\Q) 1 
IFI IFI 



We now fix e > and use the definition of Mc(e, F ) to get 
a finite set G with cardinality < Mc(e, F) and probability 
measures Q Q , a € G such that sup egS min Q6 G X 2 {Pe\\Qa) < 
e 2 . We then use inequality ( f20b to get that 

¥ JWI E X 2 (^IIQ) < M c (e, F) (1 + e 2 ) - 1. 



Q 



1 1 ee_F 



The proof is complete by the trivial observation Mc(e, i 7 *) < 
M c (e,Q). M 

The inequality ( f2TT > is due to Yang and Barron [1 Proof 
of Theorem 1]. In their paper, Yang and Barron mainly 
considered the problem of estimation from n independent 
and identically distributed observations. However their method 
results in inequality (f2TT > which applies to every estimation 
problem. Inequalities ( 1221 and d23l are new. 

Note that the lower bounds for R given in Theorem IIV. 1 1 
all depend only on the quantities N(r]) and Mt(e, 8), which 
describe packing/covering properties of the entire parame- 
ter space 6. Consequently, these inequalities only involve 
global metric entropy properties. This is made possible by 
the use of inequalities in Theorem Mill In applications 
of Fano's inequality (ITTb with the standard bound Jf < 
maxg^/gi? D(Pg\\Pgi) as well as in the application of other 
popular methods for obtaining minimax lower bounds like Le 
Cam's method or Assouad's lemma, one needs to construct 



the finite subset F of the parameter space in a very special 
way: the parameter values in F should be reasonably separated 
in the metric p and also, the probability measures Pg,9 E F 
should be close in some probability metric. In contrast, the 
application of Theorem IIV. 1 1 does not require the construction 
of such a special subset F. 

Yang and Barron |[T) have successfully applied inequal- 
ity (1211 to achieve minimax lower bounds of the optimal 
rate for many nonparametric density estimation and regression 
problems where N(rj) and Mkl(c, ©) can be deduced from 
standard results in approximation theory for function classes. 
We refer the reader to |fl] for examples. In some of these 
examples, inequality d22l can also be applied to get optimal 
lower bounds. In section [V] we shall employ inequality (l22l 
to obtain a new minimax lower bound in the problem of 
reconstructing convex bodies from noisy support function 
measurements. 

But prior to that, let us assess the performance of inequal- 
ity d22l in certain standard parametric estimation problems. 
In these problems, an interesting contrast arises between the 
two minimax lower bounds OH and (l22l : the inequality (|2H 
only results in a sub-optimal lower bound on the minimax risk 
(this observation, due to Yang and Barron [1, Page 1574], is 
also explained in Example IIV.2I below) while (l22l produces 
rate-optimal lower bounds. 

Our intention here is to demonstrate, with the help of the 
subsequent three examples, that inequality d22l works even for 
finite dimensional parametric estimation problems, a scenario 
in which it is already known [ 1 , Page 1574] that inequality (|2H 
fails. Of course, obtaining optimal minimax rates in such 
problems is facile in most situations. For example, a two-points 
argument based on Hellinger distance gives the optimal rate, as 
is widely recognized since Le Cam 11221 . But the point here is 
that even in finite dimensional situations, global metric entropy 
features are adequate for obtaining rate-optimal minimax lower 
bounds. This is contrary to the usual claim that in order to 
establish rate-optimal lower bounds in parametric settings, one 
needs more information than global entropy characteristics 0] 
Page 1574]. 

In each of the ensuing three examples, we take the parameter 
space 9 to be a bounded interval of the real line and we 
consider the problem of estimating a parameter 6 e 9 from n 
independent observations distributed according to nig, where 
mg is a probability measure on the real line. The probability 
measure Pg accordingly equals the n-fold product of rag. We 
shall work with the squared error loss so that £(x) — x 2 , p is 
the Euclidean distance on the real line and ^(77) can be taken 
to C1/77 for 77 < 770 where c\ and 770 are positive constants 
depending on the bounded parameter space alone. We shall 
encounter more positive constants c, 02,03,04,05, eo an d £i in 
the examples all of which depend possibly on the parameter 
space alone and thus, independent of n. 

Example IV.2. Suppose that mg equals the normal distribution 
with mean 9 and variance 1 . It can be readily verified that, for 
9, 9' e 9, one has 



D{Pg\\Pg. 



Q/|2 
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and 



X 2 <,Pe\\Pe>) =exp(n|. 



It follows directly that D(Pg\\Pg>) < e 2 if and only if 1 6- 0'\ < 
y/ 2e/^/n~ and X 2 (Pe\\Pe') < e 2 if and only if \0 - 6'\ < 
^/\og(l + e 2 ) I \fn. As a result, we can take 



M KL {e, 0) = and M c (e, 6) = 



c 2 \Jn 



(24) 



for e < eo- Now, inequality ( f2TT > says that the minimax risk 
i? ra satisfies 



R n > sup "V I 1 



log2 + log(c 2v Ve) + e 2 



log(ci/r?) 

The function me 2 - log e is minimized on [0, eo] at, say, 
e = ei and we then get 

■ 2 •' log n + c 3 



Rn > SUp \\l 

V<Vo 



21og Cl + 21og(l/r;); ' 



(25) 



where c 3 is a function of C2 and €%, We now note that when 
77 = c/ y/n for a constant c, the quantity inside the parantheses 
on the right hand side of d25l converges to as n goes to 00. 
This means that inequality (l2Tb only gives lower bounds of 
inferior order for R n , the optimal order being, of course, 1/n. 

On the other hand, we shall show below that inequality (l22l 
gives R n > c/n for a positive constant c. Indeed, inequal- 
ity d22l says that 



rj 

R n > sup — 1 

7)<7)o,e<eo \ 

Taking e = eo an d V = c 's/ V n > we § et 
,,2 




c 2 (l + e 2 ) 
ci0og(l + e 



1 



C3 



C4V C 3 



(26) 



where C4 depends only on ci,c 2 and eo. Hence by choosing 
c 3 small, we get that R„ > c/n for all large n. 

Example IV.3. Suppose that 9 is a compact interval of the 
positive real line that is bounded away from zero and suppose 
that me denotes the uniform distribution on [0,9]. It is then 
elementary to check that the chi-squared divergence between 
P e and P e , equals (6' /6) n - 1 if 9 < 9' and 00 otherwise. It 
follows accordingly that x 2 (Pe\\Pe') < e 2 provided 

01og(l + e 2 ) 



< e' - e < 



(27) 



Because the parameter space is a compact interval bounded 
away from zero, in order to ensure d27l ), it is enough to require 
that < 6' — 6 < C2 log(l + e 2 )/n. Therefore, we can take 

c 3 n 



M c {e,Q) = 



for e < eo- Inequality 

r 2 

Rn > SUp 



log(l + e 2 ) 
now implies that 





c 3 (l + e 2 ) 



log(l + e* 

Taking e = eq and -q — 04/71, we get that 



Rn > -T-o 1 VC4 C 5 

in z \ nci 



where C5 depends only on Ci,c 3 and e . Hence by choosing 
C4 sufficiently small, we get that R n > c/n 2 for all large n. 
This is the optimal minimax rate for this problem as can be 
seen by estimating 9 by the maximum of the observations. 

Example IV.4. Suppose that mg denotes the uniform distri- 
bution on the interval [9, 9 + 1]. We shall argue that M c {e, 6) 
can be chosen to be 



M c (e,6) = 



(-2 



(1 + e 2 ) 1 A 



1 



(28) 



for a positive constant c 2 at least for large n. To see this, let 
us define e' so that 2e' := (1 + e 2 ) 1 /" — 1 and let G denote an 
e'-grid of points in the interval 8; G would contain at most 
C2A' points when e < eo- F° r a point a in the grid, let Q a 
denote the n-fold product of the uniform distribution on the 
interval [a, a + 1 + 2e']. Now, for a fixed 9 £ 0, let a denote 
the point in the grid such that a < 9 < a + e'. It can then 
be checked that the chi-squared divergence between Pg and 
Q a is equal to (1 + 2e')" - 1 = e 2 . Hence M c (e, 0) can be 
taken to be the number of probability measures Q a , which is 
the same as the number of points in G. We thus have (1281 . It 
can be checked by elementary calculus (Taylor expansion, for 
example) that the inequality 

r 2 



(1 



,2\l/n 



1 > 



1 

2n 



1 



1 



holds for e < \/2 (in fact for all e, but for e > \/2, the 
right hand side above may be negative). Therefore for e < 
min(eo, V%), we get that 

M c( ,e)< 2e3 _ ( ^ 1/n)t , 

From inequality ( 1221 . we get that for every n < 7/0 an d e < 
min(e , \/2), 




2(1 



e 2 )c 2 



ci (2e 2 - (1 - l/n)e 4 ) 



If we now take e = min(eo, 1) and rj = c 3 /n, we see that the 
quantity inside the parantheses converges to 1 — v /c 3 C4 where 
C4 depends only on ci,c 2 and eo- Therefore by choosing c 3 
sufficiently small, we get that R n > c/n 2 . This is the optimal 
minimax rate for this problem as can be seen by estimating 9 
by the minimum of the observations. 

The fact that inequality (1221 produced optimal lower bounds 
for the minimax risk in each of the above three examples is 
reassuring but not really exciting because, as we mentioned 
before, there are other simpler methods of obtaining such 
bounds in these examples. We presented them as simple toy 
examples to evaluate the performance of d22l . to present a 
difference between (1211 and d22l (which provides a justifi- 
cation for using divergences other than the Kullback-Leibler 
divergence for lower bounds) and also to stress the fact that 
global packing and covering characteristics are enough to 
obtain optimal minimax lower bounds. In order to convince the 
reader of the effectiveness of d22l in more involved situations, 
we now apply it to obtain the optimal minimax rate in a d- 
dimensional normal mean estimation problem. We are grateful 
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to an anonymous referee for communicating this example to 
us. Another non-trivial application of d22i i is presented in the 
next section. 

Example IV.5. Let 6 denote the ball in R d of radius Y 
centered at the origin. Let us consider the problem of esti- 
mating 9 £ 9 from an observation X distributed according to 
the normal distribution with mean 9 and variance covariance 
matrix a 2 Id, where Id denotes the identity matrix of order 
d. Thus Pg denotes the N(9, a 2 Id) distribution. We assume 
squared error loss so that £(x) — x 2 and p is the Euclidean 
distance on R d . 

We shall use inequality d22i i to show that the minimax risk 
R for this problem is larger than or equal to a constant multiple 
of da 2 when Y > ayd. We begin by arguing that we can take 



■ v l'/>=('~) ,Mc(e,e)=l % 



(29) 



whenever <7-\/log(l + e 2 ) < Y. 

For N(rj), we first note that the repacking number of the 
metric space (9, p) is bounded from below by its 77-covering 
number. Now, for any 77-covering set, the space 9 is contained 
in the union of the balls of radius r\ with centers in the covering 
set and hence the volume of 9 must be smaller than the sum 
of the volumes of these balls. Therefore, the number of points 
in the 77-covering set must be at least (T/rj) d . Since this is true 
for every ^-covering set, it follows that the 77-covering number 
and hence the 77-packing number is not smaller than (Y/rj) d . 

For M c (e, 9), we first observe that for 9,9' £ 9, the 
chi-squared divergence between Pg and Pgr can be easily 
computed (because they are normal distributions with the same 
covariance matrix) to be X 2 (Pe\\Pe') — ex P {p 2 (@ > ®' ) / a2 ) — 
1. Theref ore x 2 (P e ||P gQ < e 2 if and only if p(6,0') < 
e' := <7-v/log(l + e 2 ). As a result, Mo(e, 9) can be taken 
to be any upper bound on the e' -covering number of (9,p). 
The e' -covering number, as noted previously, is bounded from 
above by the e'-packing number. Now, for any e'-packing 
set, the balls of radius e'/2 with centers in the packing 
set are all disjoint and their union is contained in the ball 
of radius Y + (e'/2) centered at the origin. Consequently, 
the sum of the volumes of these balls is smaller than the 
volume of the ball of radius Y + (e'/2) centered at the origin. 
Therefore, the number of points in the e'-packing set is at most 
(1 + (2T/e')) d < (3r/e') d provided e' < Y. Since this is true 
for every e'-packing set, it follows that the e'-packing number 
and hence the e'-covering number is not larger than (3r/e') d . 

We can thus apply inequality d22l with d29l to get that, for 
every 77 > and e > such that <7-y/log(l + e 2 ) < Y, we 
have 



R > 



1 - 



3?7 



d/2 



Vl + e 2 



(log(l + e 2 ))<V 4 



Now by elementary calculus, it can be checked that the 
function e H> vT-j- e 2 / (log(l + e 2 )) d / 4 is minimized (subject 
to a^flogjT+l 2 ) < Y) when 1 + e 2 = e d/2 . We then get that 

T) 2 f f V \d fl8e V 2 ^ 



We now take 77 = c\o\fd and since we obtain 

R > (l-cf- (18ec 2 ) d / 4 

We can therefore choose c\ small enough (independent of d) 
to obtain that R > cda 2 . Note that, up to constants, this lower 
bound is optimal for R because Ep 2 (X, 9) = da 2 . 

V. Reconstruction of convex bodies from noisy 

SUPPORT FUNCTION MEASUREMENTS 

In this section, we shall present a novel application of the 
global minimax lower bound (l22b . Let d > 2 and let K be a 
convex body in R d , i.e., K is compact, convex and has a non- 
empty interior. The support function of K, hx '■ S^ 1 — ► R, 
is defined by 

h K (u) := sup{(x,u) : x € K} for u € S"*" 1 , 

where S 4 ^ 1 := {x £ R d : J2i x i — l} i s tne un ^ sphere. We 
direct the reader to ll23l Section 1.7] or ll24l Section 13] for 
basic properties of support functions. An important property 
is that the support function uniquely determines the convex 
body, i.e., hx = hi, if and only if K = L. 

Let {u.i, i > 1} be a sequence of <i-dimensional unit vectors. 
Gardner, Kiderlen and Milanfar 11251 (see their paper for 
earlier references) considered the problem of reconstructing 
an unknown convex body K from noisy measurements of Kk 
in the directions u\, . . . ,u n . More precisely, their problem was 
to estimate K from observations Y\ , . . . , Y n drawn according 
to the model Y. L = /ik(Wj) + i — 1, • • • ,n where £1, . . . ,£„ 
are independent and identically distributed mean zero gaussian 
random variables. They constructed a convex body (estimator) 
K n — K n (Yi, . . . ,Y n ) having the property that, for nice 
sequences {ui,i > 1}, the L 2 norm \\hx — h& \\% (see (f30b 
below) converges to zero at the rate n~ 2 ^ d+3 ^ for dimensions 
d = 2. 3, 4 and at a slower rate for dimensions d > 5 (see ll25l 
Theorem 6.2]). 

We shall show here that in the same setting, it is impossible 
in the minimax sense to construct estimators for K converging 
at a rate faster than 7i~ 2 /( d + 3 ). This implies that the least 
squares estimator in ll25l is rate optimal for dimensions d = 
2, 3, 4. We shall need some notation to describe our result. 

Let JC d denote the set of all convex bodies in W l and for 
T > 0, let JC d {T) denote the set of all convex bodies in R 
that are contained in the closed ball of radius Y centered at 
the origin so that JC d (l) denotes the set of all convex bodies 
contained in the unit ball, which we shall denote by B. Note 
that estimating K is equivalent to estimating the function hx 
because the support function uniquely determines the convex 
body. Thus we shall focus on the problem of estimating hx- 

An estimator for \ik is allowed to be a bounded function on 
S d ~ x that depends on the data Y\, . . . , Y n . The loss functions 
that we shall use are the IP norms for p £ [1, 00] defined by 



IK 



h\\ P 



S d-i 



\hxiu) — h(u)\ p du 



l/p 



(30) 



for p £ [1, 00) and \ \hx — h\\oo '■— s up ne 5<i-i \hx{u) — h(u)\. 
For convex bodies K and L and p £ [1, 00], we shall also write 
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S P (K,L) for \\hjc — hi\\ p and refer to S p as the LP distance 
between K and L. 

We shall consider the minimax risk of the problem of 
estimating Hk from Y\, . . . , Y n when K is assumed to belong 
to JC d {T) i.e., we are interested in the quantity 

r„(p,r):=inf sup E K \\h K - h(Yi, . . . ,Y n )\\ p . 

h KeK d (T) 

The following is the main theorem of this section. 

Theorem V.l. Fix p € [1, oo) and T > 0. Suppose the errors 
£1 , . . . , £ n are independent normal random variables with 
mean zero and variance a 2 . Then the minimax risk r n (p, T) 
satisfies 

r«(p,r) > C(T 4/(d+3) r (d-l)/(d+3) n -2/(d+3) ) (31) 

/or a constant c that is independent of n. 

Remark V.l. In the case when p = 2, Gardner, Kiderlen 
and Milanfar |25| showed that the least squares estimator 
converges at the rate given by the right hand side of d3TT i 
for dimensions d = 2, 3, 4. Thus, at least for p = 2, the lower 
bound given by ( T3TT > is optimal for dimensions d = 2,3, 4. 

We shall use inequality (l22l to prove (|31~l l. First, let us put 
the support function estimation problem in the general estima- 
tion setting of the last section. Let := {hx ■ K G /C d (r)} 
and let A be the collection of all bounded functions on the 
unit sphere S x . The metric p on A is just the LP norm and 
t{x) = x. 

Finally, let X = K" and for / € 9, let P f be the n-variate 
normal distribution with mean vector (f(ui), . . . , f(u n )) and 
variance-co variance matrix a 2 L n , where /„ is the identity 
matrix of order n. 

In order to apply inequality d22l . we need to determine 
N(rf) and Mp(e, G). The quantity N(rj) is a lower bound on 
the 77-packing number of the set K. d (T) under the LP norm. 
When p = 00, Bronshtein [26 Theorem 4 and Remark 1] 
proved that there exist positive constants c' and 7/0 depending 
only on d such that exp (c'(?7/T)( 1_d )' 2 ) is a lower bound for 
the ?;-packing number of 9 for 77 < 770 It is a standard fact 
that p = 00 corresponds to the Hausdorff metric on K, d (T). 

It turns out that Bronshtein's result is actually true for every 
p E [1, 00] and not just for p = 00. However, to the best of our 
knowledge, this has not been proved anywhere in the literature. 
By modifying Bronshtein's proof appropriately and using the 
Varshamov-Gilbert lemma (see for example [27, Lemma 4.7]), 
we provide, in Theorem lVII.il a proof of this fact. Therefore 
from Theorem lVII.il we can take 



seen to be 

x 2 (P f \\P 9 



exp 



< exp 



1 

(J 
"11/ 



2 E 

»=i 



(/(«t) - g(ui)Y 



■g\\%> 



<e' => X 2 (Pf\\P g ) <e 2 



(33) 



It follows that 

II/-; 

where e' := <r^/log(l + e 2 )/^Jn. Let W c > be the smallest W 
for which there exist sets K 1: . . . ,Kw in K, d (T) having the 
property that for every set K 6 JC d {Y), there exists a Kj 
such that the Hausdorff distance between K and Kj is less 
than or equal to e'. It must be clear from d33l that Mc(e, 9) 
can be taken to be a number larger than W e / . Bronshtein ll26l 
Theorem 3 and Remark 1] showed that there exist positive 
constants c" and eq depending only on d such that 



- r \ (<<-i)/2 
\<>n\V,' :• <■" ( - for e' < e . 



Hence for all e such that log(l + e 2 ) < ne^/a 2 , we can take 

(d-l)/2 

logM c (6,9)^ C " ( ; / Vn =) . (34) 



We are now ready to prove inequality OTb . We shall define 
two quantities 

T)(n) := ca 4/{d+3) T {d - 1)/{d+3 ' ) n- 2/{d+3) 

and 

«(n) := ^— 

where c = c(d,p) will be specified shortly. Also let e(n) be 
such that log(l + e 2 (n)) = u 2 (n). Clearly as n — > 00, we 
have 77(71) — > 0, «(n) — > 00 and u(n)/y/n — > 0. As a result 
77(71) < 7/0 and u 2 (n) < ne^/a 2 for large n and therefore 
from (l32l and d34l . we get that 

(d-l)/2 , 



log iV (77(71)) 



r 

77(71) 



c (d-X)/2 



7J (7l). 



and 



logM c (e(n),e) = c" 



(d-l)/2 



^au(n) J 

We now apply inequality d22l (recall that €(x) 
that r n (p, r) is bounded from below by 



c u (n). 
= x) to obtain 



77(77.) 



exp 



u 2 (n) 



N(r](n)) V 2 V C ( d ~ 1 )/ 2 

for all large 77. If we now choose c so that c^ -1 '/ 2 = c'/ (2 
2c"), we get that 



log N(*q) = d 



(d-l)/2 



for 77 < 770, (32) r «(p,r)> 



77(77) 



exp 



-7j 2 (n) 



where c' and 770 are constants depending only on d and p. 

Now let us turn to M c (e, 9). For # e 9, Pf and P ff 
are normal distributions with the same covariance matrix and 
hence the chi-squared divergence between Pf and P g can be 



N(v(n)) 

Now observe that as 77 — > 00, the quantity 77(77) goes to and 
hence AT (77(71)) goes to 00. Further, as we have already noted, 
77,(77) goes to 00. It follows hence that r n {p,Y) > 7/(n)/4 
for all large 77. By choosing c even smaller, we can make 
inequality OTI ) true for all 77. 
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VI. A COVARIANCE MATRIX ESTIMATION EXAMPLE 

In the previous section, we have used the global minimax 
lower bound (f22b . However, in some situations, the global 
entropy numbers might be difficult to bound. In such cases, 
inequalities d2lT > and d22b are, of course, not applicable and we 
are unaware of the use of inequality dTvT > in conjuction with 
Fano's inequality (fTTt in the literature. The standard examples 
use (flTT i with the bound J/ < mme,e'eF D(Pg\\Pg<) while the 
examples in [ 1 1 all deal with the case when global entropies are 
available. In this section, we shall demonstrate how a recent 
minimax lower bound due to Cai, Zhang and Zhou [13] can 
also be proved using inequalities (fTTT i and (I17t . 

Cai, Zhang and Zhou |[T3l considered n independent p x 1 
random vectors X\, . . . , X n distributed according to N p (0, E). 
Suppose that the entries of the p x p covariance matrix 
E = ((Jij) decay at a certain rate as we move away from the 
diagonal. Specifically, let us suppose that for a fixed positive 
constant a > 0, the entries ov, of E satisfy the inequality 
fr,j < \i — for i 7^ j. Cai, Zhang and Zhou |13| 



showed that when p is large compared to n, it is impossible 
to estimate E from X\ , . . . , X n in the spectral norm at a rate 
faster than n^ a ^ 2a+1 \ More precisely, they showed that when 

p > Cn l/(2a+l) f 



R n (a) :=inf supE s ||E-E| 
s see 



> c n 



-a/(2a+\) 



(35) 



where c and C denote positive constants depending only on 
a. Here 6 denotes the collection of all covariance matrices 



E = ((Jij) satisfying <Xy < 



3 



01 1 for i 7^ j and the 



norm |.| is the spectral norm (largest eigenvalue). 

Cai, Zhang and Zhou |[l"3l used Assouad's lemma for the 
proof of the inequality (l35l l. We shall use inequalities (flTT i 
and ( fT7l >. Moreover, the choice of the finite subset F that we 
use is different from the one used in lfT3l Equation (17)]. This 
makes our approach different from the general method, due to 
Yu ll28l . of replacing Assouad's lemma by Fano's inequality. 

Throughout, A denotes a constant that depends on a alone. 
The value of the constant might vary from place to place. 

Consider the matrix A = (a^ ) with = 1 for i = j and 
dij = l/(A\i — j\ a+1 ) for i ^ j. For A sufficiently large 
(depending on a alone), A is positive definite and belongs to 
O. Let us fix a positive integer k < p/2 and partition A as 



A 



' An 


Al2 ' 


A 12 


A-2,2 



where A\\ is k x k and A22 is (p — k) x (p — k). For each 
r g M. k , we define the matrix 



A(r) :- 



An 


A 12 (t) ' 


(Al2(T)f 


A22 



where A\2(t) is the k x (p — k) matrix obtained by premul- 
tiplying A12 with the k x k diagonal matrix with diagonal 
entries ri,...,r fc . Clearly, A(t) <E for all t € {0, l} fc . 
We shall need the following two lemmas in order to prove 
inequality ([35). 



Lemma VL1. For r, r' G {0, 1} , r ^ t', we have 
\\A(t)-A(t')\\> 



1 /T(r,r') 



Ak a 



(36) 



where T(r,r') :— 53r=l { Tr ^ T 'r\ denotes the Hamming 
distance between r and r'. 

Proof: Fix r, r' € {0, l} fc with r 7^ r'. Let u denote the 
p x 1 vector (0^-, lfc, p _2fc) , where Ofc denotes the k x 1 
vector of zeros etc. Clearly ||w|| 2 = fc and (A(t) — A(t'))v 
will be a vector of the form (u, 0) T for some fc x 1 vector 
u = (m,..., u k ) T . Moreover u r = Yl!l=i( T r - T' r )a r:k+S and 
hence 



{Tr * <} 



s=l 1 

2fe-l 



1 



- k - s\ a+1 



~ A ^ i Q+1 ~ A fc Q ' 



Therefore, 



(^(r)-A(r')),|| 2 >E^> Air T ^-') 



r=l 



The proof is complete because \\v\ 



Lemma VI.2. Let 1 < m < fc,r G {0, 1} one/ r' := 
(0,...,0, T m ,...,r k ). Then 



D(N(0,A(t))\\N(0,A(t'))) < 



A 



(k — m) 2 



Proof: The key is to note that one has the inequality 

D(N(0,A(t))\\N(0,A(t'))) < A\\A(t) - A(t')\\ 2 f , where 

ll-^-ll-F : ~ \^2ij a ij) denotes the Frobenius norm. The 
proof of this assertion can be found in lfl3l Proof of Lemma 
6]. We can now bound 

m— 1 p—k 

\\A(r)-A(r')\\j 7 <2Y / r?Y,<^ 

r=l j=l 

m— lp— k 1 



^ \r-k- j\ 2a+2 
=1 3=1 1 Jl 

1 



m — 1 00 

r=l j=l 
m — 1 



|fc - r + j| 2Q + 2 
1 

< 



A 



~ A ? ( k - r) 2a+1 - (k - m) 2a ' 

The proof is complete. ■ 
The Varshamov-Gilbert lemma (see for example ll27l 
Lemma 4.7]) asserts the existence of a subset W of {0, l} fc 
with \W\ > exp(fc/8) such that T(r,r') > fc/4 for all 
t,t' e W with r 76 t'. Let F := {A(t) : r G VF}. From 
inequality (fTTT i and Lemma [VI. II we get that 

BMs ll/, ^ + ^Y.AeFD{PA\\P) \ 

Rnia) * AF 1 1 J ' 

(37) 
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where Pa denotes the n-fold product of the N(Q,A) prob- 
ability measure and P := J^agf Pa/\W\. Now for 1 < 
m < k and for t G {0, l} fe m+1 i let Q t denote the n- 
fold product of the N(0, A(0, . . . , 0, h, . . . , t k - m +i)) prob- 
ability measure. Applying inequality ( [T7l i. we get the quantity 
J^AeF d ( p a\\P)/\W\ is bounded from above by 



(k 



l)log2 



max mm 

AeF te{o,i} fc -> 



Now we use Lemma [VI. 21 to obtain 
1 

W\ 

Using the above in (l37l >. we get 



E 



D(P A \\P)<A 



(k 



D(P A \\Qt). 



(k — to) 2 



R n {a) > 



1 1 



1 



A 

T 



(k 



(k 



Note that the above lower bound for R n {a) depends on k and 
to, which are constrained to satisfy 2k < p and 1 < m < k. 
To get the best lower bound, we need to optimize the right 
hand side of the above inequality over k and to. It should 
be obvious that in order to prove (l35l l. it is enough to take 
k - to = nVfaa+i) and £ _ 4An 1 /( 2a+1 ). The condition 
2k < p will be satisfied if p > Cn 1 ^ 2a+1 ' 1 for a large enough 
C. It is elementary to check that with these choices of k and 
to, inequality (f35T > is established. 

VII. A Packing Number Lower Bound 

In this section, we shall prove that for every p e [1, oo] 
the repacking number N(rj;p,T) of K, d (T) under the LP 
metric is at least exp (c^/T^ 1- ^/ 2 ) for a positive c and 
sufficiently small r\. This means that there exist at least 
exp (c(?7/r)( 1 ~ d ^ 2 ) sets in K. d (T) separated by at least rj 
in the LP metric. This result was needed in the proof of 
Theorem IV. II Bronshtein ll26l Theorem 4 and Remark 1] 
proved this for p = oo (the case of the Hausdorff metric). 

Theorem VII. 1. Fix p G [l,oo]. There exist positive constants 
r]o and C depending only on d and p such that for every 
V < f]Q, we have 

(d-l)/2\ 



N( mP ,T) >exp C 



(38) 



Proof: Observe that by scaling, it is enough to prove for 
the case T = 1. We loosely follow Bronshtein [26 Proof of 
Theorem 4]. Fix e G (0, 1). For each point x G S d ~ 1 , let S x 
denote the supporting hyperplane to the unit ball B at x and 
let LL X be the hyperplane intersecting the sphere that is parallel 
to S x and at a distance of e from S x . Let and H~ denote 
the two halfspaces bounded by LL X where we assume that 
contains the origin. Let T x := S d ^ 1 nH~ and A x := BC\H X , 
where B stands for the unit ball. It can be checked that the 
(Euclidean) distance between x and every point in T x (and A x ) 
is less than or equal to \[2^fi. It follows that if the distance 
between two points x and y in is strictly larger than 

2\/2y/e, then the sets T x and T y are disjoint. 

By standard results (see for example ||26l Proof of Theorem 
4] where it is referred to as Mikhlin's result), there exist 



positive constants C\, depending only on d, and e such 
that for every e < e , there exist N > Ci(v / e) 1_(i points 
xi, . . . , xn in S^ -1 such that the Euclidean distance between 
Xi and Xj is strictly larger than 2^/2y/e whenever i ^ j. 
From now on, we assume that e < eo. We then consider 
a mapping $ : {0,1} JC d (l), which is defined, for 

t = (t 1 ,...,t n )e{0A} N , by 

*(t) := b n Di( n ) n d 2 (t 2 ) n • • • n d n {t n ), 

where for i = 1, . . . , N, 

Di(0) := H+ and A(l) := B. 

It must be clear that the Hausdorff distance between $(r) 
and $(t') is not less than e (in fact, it is exactly equal to 
e) if r ^ t'. Thus, |$(t) : r G {0,1} W | is an e-packing 

set for JC d {l) under the Hausdorff metric. However, it is not 
an e-packing set under the LP metric. Indeed, the LP distance 
between $(r) and $(r') is not necessarily larger than e for all 
pairs (t, t'), t ^ t'. The LP distance between $(r) and $(r') 
depends on the Hamming distance T(r, r') = X)i { r » 7^ r z'} 
between t and t'. We make the claim that 

S p ($(r), $(r')) > C 2 e (Vi) (d_1)/P (T(r, r')) 1/p , (39) 

where C2 depends only on d and p. The claim will be proved 
later. Assuming it is true, we recall the Varshamov-Gilbert 
lemma from the previous section to assert the existence of 
a subset W of {0,1}^ with \W\ > exp(iV/8) such that 
T(t,t') > for all r,r' <E W with t ^ r'. Because 

> C 1 { y /e) 1 - d , we get from @9) that for all t,t' e 14^ 
with t^t', we have 

<Jp ($(t), $(t')) > C 3 e where C 3 := C 2 ' 



Taking r; := C3£, we have obtained, for each 77 < rjo := C 3 eo, 
an //-packing subset of /C d (l) with size M, where 



log M > A/8 > 



1 



C 4 



1 



8 VvV Vv^. 

The constant C4 only depends on d and p thereby proving ( f38b - 
It remains to prove the claim ( f39b . Fix a point x G S' d ~ 1 
and e G (0, 1). We first observe that it is enough to prove that 

S P (A x> T x y>C 5 e p (V?) d -\ (40) 

for a constant C5 depending on just d and p, where A x and 
are as defined in the beginning of the proof. This is because 
of the fact that for every t,t' G W with r 7^ r', we can write 



(41) 



where L := {1 < i < AT : t, ^ r-}. The equality (04]) is a 
consequence of the fact that the points X\, . . . , xn are chosen 
so that T Xl , . . . , T XN are disjoint. 

We shall now prove the inequality d40i > which will complete 
the proof. Let uq denote the point in A x that is closest to the 
origin. Also let u\ be a point in A x n S fd_1 . Let a denote the 
angle between uq and u\. Clearly, a does not depend on the 
choice of u\ and cos a = 1 — e. Now let u be a fixed unit 
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vector and let 9 be the angle between the vectors u and uq. 
By elementary geometry, we deduce that 



h Tm (u) - h Am (u) = 



1 - cos {a - 9) if < 9 < a, 







otherwise. 



Because the difference of support functions only depends on 
the angle 9, we can write, for a constant Cq depending only 
on d, that 



S P (A X , T x y = C7 6 / (1 - cos(a - 9)f sin 
Jo 



d-2 



9d9. 



Now suppose f3 is such that cos(a — 0) = 1 — e/2. Then from 
above, we get that 

S P (A X ,T X ) P >C 6 (1 - cos(a - 9)) p sm d - 2 9d9 



> c ( 



i d ~ 2 9d9 



sin d - 2 9 cos 9d9 



id 



sm 



/3. 



We shall show that sin/3 > (^/e) /(2%/2) which will 
prove (140b . Recall that cos a = 1 — e. Thus 

1 - | = cos(a - /3) 

< cos a + sin a sin /3 

= 1 - e + v 7 ! - (1 - e) 2 sin j3 

< 1 - e + V^Vesin^, 

which when rearranged gives sin 

/3 > (%/e) /(2\/2). The proof 
is complete. ■ 

VIII. Conclusion 

By a simple application of convexity, we proved an in- 
equality relating the minimax risk in multiple hypothesis 
testing problems to /-divergences of the probability measures 
involved. This inequality is an extension of Fano's inequality. 
As another corollary, we obtained a sharp inequality between 
total variation distance and /-divergences. We also indicated 
how to control the quantity Jf which appears in our lower 
bounds. This leads to important global lower bounds for the 
minimax risk. Two applications of our bounds are presented. 
In the first application, we used the bound d22b to prove a new 
lower bound (which turns to be rate-optimal) for the minimax 
risk of estimating a convex body from noisy measurements of 
the support function in n directions. In the second application, 
we employed inequalities (fTTI) and ( fT71 ) to give a different 
proof of a recent lower bound for covariance matrix estimation 
due to Cai, Zhang and Zhou lfl3l . 
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